Youtube Data Analysis and 2019 Data Predictions

CMSC320

By Rachel Taskale

Social media and content that we consume reflects the current trends and interests of the population. As a person who grew up during the birth and growth of Youtube, most of my childhood was spent watching Youtube videos. This helped to shape my interests, views, and ideals throughout my life and the same goes for many others.

However, the Youtube Trending page is a part of Youtube that does not reflect just my own interests like the Youtube homepage does. It reflects an entire popoulation of people, determined by the country.

In this Data Science Tutorial, I will walk you through the Data Science pipeline of data curation and management, exploratory data analysis, hypothesis testing and machine learning.

Disclaimer to Grader: One of the tables that I used gets updated daily, therefore, it may have slightly different numbers than what I have. However, unless something "breaks the internet" a.k.a goes viral before my grading is complete, all of the graphs should be able to be recreated.

About The Data:

The first dataset that I'm using for the final project is 2017-2018 data on Youtube Trending Videos from Kaggle The reference page to the data was: https://www.kaggle.com/datasnaek/youtube-new?select=USvideos.csv

The second dataset that I am using for the final project is 2020-2021 data on Youtube Trending Videos, also from Kaggle. The reference page to the data was: https://www.kaggle.com/rsrishav/youtube-trending-video-dataset and it was the US data.

The way I interpreted the data: Upon observing the dataset content, I took everything in relation to when it was posted to the trending page, rather than what the current stats are. That is, if a video gets put onto the trending page, the view count, likes, dislikes, and comments for the now trending video would stop.

Citations: J, M. (2019, June 2). Trending YouTube Video Statistics. https://www.kaggle.com/datasnaek/youtube-new?select=USvideos.csv. Sharma, R. (2021, May 17). YouTube Trending Video Dataset (updated daily). Kaggle. https://www.kaggle.com/rsrishav/youtube-trending-video-dataset.

Category ID Subject
1 Film & Animation
2 Autos & Vehicles
10 Music
15 Pets & Animals
17 Sports
18 Short Movies
19 Travel & Events
20 Gaming
21 Videoblogging
22 People & Blogs
23 Comedy
24 Entertainment
25 News & Politics
26 Howto & Style
27 Education
28 Science & Technology
29 Nonprofits & Activism
30 Movies
31 Anime/Animation
32 Action/Adventure
33 Classics
34 Comedy
35 Documentary
36 Drama
37 Family
38 Foreign
39 Horror
40 Sci-Fi/Fantasy
41 Thriller
42 Shorts
44 Trailers

Cleaning the Data:

Since I decided to take two different tables, for future reference, I needed to combine them in to one large table for convience. Upon observation, it was clear that the two tables were in different formats with different column names to represent the same subjects, therefore, I created a new table in an order that I felt usable and combined all the data.

As a part of the cleaning, I standardized the trending date data into datetime type and removed duplicate data.

  1. We remove make all the new columns in our total_data set. Then I also changed over all the data from whatever format they were before, to datetime so it was easier to use.
  1. I merged the two tables together by creating a new pandas table with standard names. The difficulty in this was that the two different tables had similar, but different names in each of the tables, also they were in different orders so it led to hard code wrangling
  1. When I initially entered the data into the table, I made sure to enter in the 2021 data first because it's the most up to date. Then I decided to remove the latter added (2017-2018) duplicate to avoid dulipcate code.
  1. Following the data cleaning, I checked to make sure that there were no null values. Using the .isnull() function, the only place where there appeared to be null values was in the description.

    I decided to convert those null values to "N/A" to create a standard value to refer to no description.

Here is a head view of what the data now looks like.

EDA

For this section, I wanted to get a deeper understanding of why a video For this next part, I first wanted to see what the correlation was for every value against each other to determine what I waned to do look at.

As you can see, from the graph above, the factors that have the strongest relationship with eachother are ratings disable + comments disabled, comment_count + likes, comment_count + dislikes, dislikes + likes, views + likes, views + dislikes, views + comment_count. For these reasons, I want to initally analyze the relationship between likes, dislikes, comments, and views on from the trending videos.

First, I'm going to analyze channels that are the most frequently on youtube trending overall. This will give us a better idea as to who the leading Youtube channels are.

From first glance, it is clear that Youtube's main focus is in high gross areas such as sports, Netflix, and talk shows which tend to generate a lot of content and have high view counts.

In this, I got the top 10 most watched youtube trending videos across all the data and plotted the number of views, likes, dislikes, and comment count. It is important to note that from my initial assumption about the data that these numbers were recorded at the time that the video was added to trending. Just from observation, you see that both BTS' Dynamite and Cardi B's WAP were not added to trending until they had hundreds of millions of views. An inference that you can make from the this is that international music and controversial music will take a much longer time to be added to Youtube trending due to protection of Youtube image as well as what they believe the American populatio will want to watch.

Specifically in the case of Cardi B adn Megan Thee Stallion's WAP music video, it can be inferred that Youtube did not want to post such an explicit video to a technically "family friendly" website with a predominantly christian and traditional value society to follow.

Also to note that, there appears to be quite a number of K-Pop groups that were not added to trending until they had a significant number of views, including a song with American singer Selena Gomez.

Next I'm going to analyze the top trending videos based on the number of likes

It appears that nearly every single song on top likes graphs are in fact K-Pop groups, with BTS nearly taking the entire graph themselves. From this we can infer that American Youtube watchers really like K-Pop

Now let's look at the top videos based on dislikes.

Based on the graph it is evident that there is a correlation between the highest viewed videos also getting the most dislikes and likes. However, some important data to note include the magnitude of cancel culture on Logan Paul for filming in Japan's suicide garden: https://www.nbcnews.com/pop-culture/pop-culture-news/youtuber-logan-paul-sued-over-suicide-forest-video-n1252610 as shown in his "So Sorry" video and the response to his return to youtube, "LOGAN PAUL IS BACK".

Finally, I also wanted to take a look at the least viewed videos that were able to make it to the trending page.

The top ten videos with the least number of views is interesting because it appears that there are a number of videos that did not have any views yet were still put on trending and also have likes and dislikes. Most likely this is due to inconsistencies in the table data, however, it is not unlikely to suggest that the Youtube Algorithm thought that some of these videos would go viral, therefore, put it on the youtube trending page.

Also another question to ask is how did two of the videos that did not have any views recieve likes, dislikes, and comments? The most probable answer would be bots on Youtube used to make a video more popular.

Here is a pie chart to demonstrate the breakdown of the most watched trending video categories. If you refer to the chart given in the intro,

Category ID Subject
1 Film & Animation
2 Autos & Vehicles
10 Music
15 Pets & Animals
17 Sports
18 Short Movies
19 Travel & Events
20 Gaming
21 Videoblogging
22 People & Blogs
23 Comedy
24 Entertainment
25 News & Politics
26 Howto & Style
27 Education
28 Science & Technology
29 Nonprofits & Activism
30 Movies
31 Anime/Animation
32 Action/Adventure
33 Classics
34 Comedy
35 Documentary
36 Drama
37 Family
38 Foreign
39 Horror
40 Sci-Fi/Fantasy
41 Thriller
42 Shorts
44 Trailers

you can see that the top 3 most watched areas on youtube are music, entertainment and News & Politics.

Here is some visualization on current 2020-2021 trends. As you can see, Lil Nas' Montero was able to make it to the Youtube trending despite having an incredibly controversial music video to the religious community of America.

Here is some visualization on the 2017-2018 trends

Next I wanted to observe the frequency of words occurrences in tags based on years

These are the top words used in tags overall for 2017-2021. As you can see, the majority of the Youtube population in America is interested in a variety of topics from gaming such as Among us and Apex to watching funny videos to tiktok or music.

This is a more curated dataset with all the data only from 2017-2018. There are still some similarities, but most of the trends are more relevant to that year including black panther, star wars, Will Smith's appearance in Youtube rewind: https://www.youtube.com/watch?v=YbJOTdZBX1g&t=2s

This is a more curated dataset with all the data only from 2020-2021. There are still some similarities, but more emphasis on among us, apex legend, tik tok, and james charles.

Finally, I wanted to take the time to analyze the relationship between tags and ratings_disabled. From this we can gain more insight into what topics may have caused a channel to turn off ratings on their video.

I did the same as before, but with comments_disabled instead.

Part 4: Linear Regression Model

Since I decided to take two separate tables of different years of youtube data, we are missing all the data from 2019 related to trending videos on Youtube. For this project, I will be predicting the 2019 data from the 2017-2018 and 2020-2021 data. Through this we can better visualize how Youtube determines what videos will be trending. This way we can observe overall trends of what content Youtube tends to deem trending.

First, I decided to import all the data into a new table, dropping columns I felt were unneeded for the linear regression

I applied a linear regression model to train and predict trends in each category for 2019 based on the data provided. Each category is labeled with their paralleling cateory id

From this data, we can observe that 2019 data will continue the current trends mapped from 2017-2021. It is important to note that there are very obvious trends that show in the future that certain categories will exceed others with their positive slope trends such as 19 and 15. This may prove to predict what the future major cateories of Youtube are and what people are mainly interested in.

Next I did a linear regression on years and likes to see if there was any relationship

In 2020, there was significantly more likes than in past years. Given the context of the pandemic,it is plausible to believe that more people would be watching youtube videos because everything was remote and online. This carries over into the current 2021 data as well. However, where the linear regression model fails is in the unexpected event of the pandemic. The linear regression line simply shows a gradual increase of likes over time, therefore, does not take into account the true nature of the spike in likes.

Following this I made a violin plot of the likes vs year to further show that the number of likes was much higher than in past years

Conclusion

The purpose of this project, was not necessarily to determine any new information, but rather to show how the US Youtube trending reflects many of our past and current trends in America. It was also used to show how we can predict the 2019 data, despite the impact of the coronavirus on the US

In our hypothesis testing, I decided to pick variables based on how significant they were in the correlation matrix, ultimately decidings on likes over time. However, this may not have been the best decision because I failed to get rid of the 2021 data. Since 2021 is not complete, we were not able to determine the exact relationship between like and specific years,

In addition, the hypothesis testing and ML that included 202-2021 data poorly reflects on the overall trends in America because they fail to take into consideration a pandemic.

Here were the main resources I used: